4 Main Analysis
Note: the data is zipped in the data folder when downloaded and must be unzipped in order to read
Exloratory Data Analysis
This file explores the cleaned version of the dataset provided by the ‘part4_cleaning.rmd’ file.
To begin, we read in the cleaned data as well as some additional data we will compare against later on. For this analysis, we binned the shapes into five categoriexs.
ufo <- as_tibble(read_csv("data/ufo_clean_final.csv"))
ufo <- ufo %>% select(-X1, -X1_1)
state_pops <- as_tibble(read_csv("data/StatePops.csv"))
excer <- as_tibble(read.csv("data/Excercise.csv"))
colnames(excer)[2] <- "Excercise"
trump_state <- as_tibble(read_csv("data/TrumpState.csv"))
obese <- as_tibble(read_csv("data/Obesity.csv"))
cbPalette <- c("#56B4E9", "#F0E442", "#E69F00", "#009E73", "#CC79A7", "#D55E00", "#999999", "#0072B2")
ufo <- ufo %>%
mutate(Shape2 = fct_relevel(Shape2, "Circle", "Light", "Triangle", "Rectangle", "Other"))
NA NA NA NA… Missing Data Analysis
As seen in the data quality report, nearly all sightings reported contain complete data, and the observations missing some data are mostly just missing a single field value.
visna(ufo[, c('Duration','Shape','State','Country','Time','City','Desc','Summary' )], sort = "b")
We considered the possibility that international reports may be more likely to have missing data. However, we found that this is not the case. Missing data is consistent across reporting countries.
percent_missing <- ufo %>% group_by(`In_USA`) %>%
summarize(num = n(), num_na = sum(is.na(`Shape`))) %>%
mutate(percent_na = round(num_na/num, 2)) %>%
arrange(-percent_na)
percent_missing
## # A tibble: 2 x 4
## In_USA num num_na percent_na
## <lgl> <int> <int> <dbl>
## 1 FALSE 13268 1364 0.1
## 2 TRUE 101570 10253 0.1
We also wondered whether any component of the sighting itself would impact what data was omitted when filing the report. Here, we consider whether the ‘Shape’ of the object has any impact on missing data, and found that there is little variance between shapes.
percent_missing <- ufo %>% group_by(`Shape1`) %>%
summarize(num = n(), num_na = sum(is.na(`Duration`))) %>%
mutate(percent_na = round(num_na/num, 2)) %>%
arrange(-percent_na)
ggplot(percent_missing, aes(x=Shape1, y=percent_na)) +
geom_bar(stat='sum',fill='dodgerblue') +
xlab('Shape') + ylab('% Missing') +
theme_light() + ggtitle('Percent of Missing Values by Shape') + theme(legend.position = 'none') + scale_y_continuous(labels = percent)
Review Sightings by State
Here we split the duration into discrete bins to make future plotting easier
USA_ufo <- ufo %>% filter(In_USA == TRUE & is.na(Shape2) == FALSE)
bin_duration <- function(duration){
if(is.na(duration) == TRUE){
return(NA)
}
if(duration < 60){
return("< 1")
}
else if(duration < 300){
return("1 - 5")
}
else if(duration < 600){
return("5 - 10")
}
else if(duration < 1800){
return("10 - 30")
}
else if (duration < 3600){
return("30 - 60")
}
else{
return("60+")
}
return(NA)
}
Unidentified Flying Graphs
ufo <- ufo %>% mutate(`Bin_Dur` = lapply(ufo$Duration, bin_duration))
ufo$Bin_Dur <- unlist(ufo$Bin_Dur)
ufo <- ufo %>%
mutate(Bin_Dur = fct_relevel(Bin_Dur, "< 1", "1 - 5", "5 - 10", "10 - 30", "30 - 60", "60+"))
As we would expect, the number of reported sightings increases with the availability of the internet, especially after the creation of the NUFORC website in 1998.
ggplot(ufo, aes(x = year)) + geom_bar(fill='dodgerblue') + scale_y_continuous(name = "Sightings (thousands)", breaks = c(2000,4000,6000,8000), labels = c(2,4,6,8))+
ggtitle("Sighting Frequency (Full Dataset Timescale)") + theme_light()
ufo1950 <- ufo %>% filter(year>=1950)
ggplot(ufo1950, aes(x = year)) + geom_bar(fill='dodgerblue') +
ggtitle("Sighting Frequency (1950-2018)") + theme_light() + scale_y_continuous(name = "Sightings (thousands)", breaks = c(2000,4000,6000,8000), labels = c(2,4,6,8))
As seen below, the most frequent shape of observations in the data are circular shaped observations with light as a close second. This is consistent with the long portrayed image of a ‘UFO’ in pop culture. Many science fiction media portay UFOs as saucer type objects that have lights, often flashing. The third most common shape is triangluar. This is consistent with a more recent phenomenon in the UFO sighting community known as the Black Triangle theory (https://www.space.com/302-silent-running-black-triangle-sightings-rise.html)
ggplot(ufo, aes(x = Shape2)) + geom_bar(fill='coral2') +
ggtitle("Sighting Frequency by Observed Shape") +
xlab("Shape") + scale_y_continuous(name='Sightings (thousands)', labels = c(10,20,30),breaks = c(10000,20000,30000)) + theme_light()
Most of the sightings in the dataset are fairly short (between seconds and 5 minutes); however, there are a significant amount of sightings of all durations.
ggplot(ufo, aes(x = Bin_Dur)) + geom_bar(fill='darkcyan') + labs(x = "Duration in Minutes", y = "Number of Sightings") +
ggtitle("Sighting Frequency by Duration") + scale_y_continuous(name='Sightings (thousands)', labels = c(10,20,30),breaks = c(10000,20000,30000)) + theme_light()
Zooming in to the sightings less than one hour, we can make a few observations.
- Duration is heavily right skewed, with most sightings last less than 5 minutes
- There is evidence of rounding in reporting duration with nearly all values greater than 10 minutes, rounded to 5 minute intervals. This trend is even more evident when looking at the 30 min, 45 min, and 60 min intervals.
ufodurclip <- ufo %>% filter(is.na(Duration) == FALSE & Duration <= 3600)
ggplot(ufodurclip, aes(x = Duration)) + geom_histogram(, color="black", bins=60,fill="springgreen3") +
ggtitle("Duration Histogram, Limited to 1 Hour") + scale_y_continuous(name='Sightings (thousands)', labels = c(10,20,30),breaks = c(10000,20000,30000)) + scale_x_continuous(name = "Duration (minutes)", breaks = seq(0,3600,300), labels = seq(0,60,5)) + theme_light()
Duration of observation is independent of the shape. That is, UFOs of circular shape are no more or less likely to be seen for a longer amount of time than, say, triangular UFOs.
temp <- ufo %>% filter(year >= 1995 & year != 2018 & is.na(Bin_Dur) == FALSE) %>% group_by(Bin_Dur) %>% summarize(durfreq = n())
durshape <- ufo %>% filter(year >= 1995 & year != 2018 & is.na(Bin_Dur) == FALSE) %>% group_by(Bin_Dur, Shape2) %>% summarize(durshapefreq = n())
durshape <- merge(durshape, temp, by="Bin_Dur")
durshape$prop <- durshape$durshapefreq/durshape$durfreq
ggplot(durshape, aes(x = Shape2, y = prop, fill = Shape2)) +
geom_bar(stat = "identity") + facet_wrap(~Bin_Dur) +
ggtitle("Shape Proportion Across Durations") +
xlab("Duration in Minutes") + scale_fill_manual(values=cbPalette, name = "Shape") + scale_y_continuous(labels = percent) + ylab("Proportion of Sightings")+ theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme_light()
As expected, the majority of UFO sighting happen between 6pm and midnight. Late enough that it’s dark, but early enough that people haven’t gone to sleep yet. We choose to use 24 bins (one for each hour of the day) to smooth the effects of rounding in the trend. We can also notice a bump at 12noon, indicating typical lunch break when people are more inclined to be outside.
ggplot(ufo, aes(as.POSIXct(Time))) +
geom_histogram(bins=24, color="black", fill="springgreen3") +
ggtitle("UFO Sighting Frequency Over Time") +
labs(y = "Frequency") +
# scale_x_datetime(breaks = date_breaks("1 hour"))
scale_x_datetime(breaks = date_breaks("2 hour"), labels = date_format("%H:%M")) +
xlab("Time of Day") +
scale_fill_viridis(name = "Frequency") + theme_light()+ scale_y_continuous(name='Sightings (thousands)', labels = c(2,4,6,8,10,12,14),breaks = c(2000,4000,6000,8000,10000,12000,14000))
We can also observe that there are significantly more sightings during the warmer months of the year (noting that all of the reports are in the northern hemisphere). We suspect that these are the months where people are more inclined to spend time outside and therefore have more opportunity to observe the UFOs.
Now lets focus on the ‘description’ field. This field is rightfully free form, and allows us to do some text-based analysis. First, we will look at how descriptive the reports are and if there are any other features that entice reporters to provide longer descriptions. Later (see part67.rmd) we consider the word frequency and two-word affinity within comments.
We can see that descriptions tend to be 200 words or less with a mean around 100 words, but there is a significant percentage of descriptions that exceed 400 words.
ufoShortDescription <- filter(ufo, DescriptionLength < 1000)
ggplot(ufoShortDescription, aes(DescriptionLength)) +
geom_histogram(bins=70, color="black", fill="firebrick3") +
ggtitle("UFO Sighting Description Word Count by time of Day and Shape") +
labs(y = "Frequency", x = "Word Count") + scale_fill_viridis(name = "Frequency") + scale_y_continuous(name = "Sightings (thousands)", breaks = seq(2000,8000,2000), labels = seq(2,8,2)) + theme_light() + scale_x_continuous(name = "Word count", breaks = seq(0,1000,100), labels = seq(0,1000,100))
We considered whether observing certain shapes would encourage a longer description. When looking for this trend, we actually uncovered a separate, but equally interesting one. During the daylight hour observations, there is a clear drop off of “light” observations. This seems obvious due to lighted objects being more difficult to spot during daytime. This trend can be seen as the color gradient changes over time in the scatter plot below. At night, as we saw earlier, there is an increased frequency of reports, and the distribution of shapes is more unifrom.
ggplot(ufoShortDescription, aes(x=as.POSIXct(Time), y=DescriptionLength)) +
geom_point(alpha=0.2, aes(color = Shape2)) +
ggtitle("Word Count by time of Day and Shape") +
labs(y = "Word Count", x = "Time") +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
scale_color_manual(values=cbPalette, name = "Shape") +
scale_x_datetime(breaks = date_breaks("2 hour"), labels = date_format("%H:%M")) + theme_fivethirtyeight()
To reinforce the above trend, when faceting the same information, we can see that circles are more prominent during the day, and light sightings are more frequent at night.
temp8 <- ufoShortDescription %>% filter(year >= 1995 & year != 2018 & is.na(Shape2) == FALSE) %>% group_by(Time, Shape2) %>% summarize(freq = n())
ggplot(temp8, aes(Time, freq, color = Shape2)) + geom_line() + facet_wrap(~Shape2) +
ggtitle("UFO Sighting Shapes Over Time of Day") +
labs (x = "Time", y = "Sightings") + scale_color_manual(values=cbPalette, name = "Shape") + theme_fivethirtyeight()
Geospatial Analysis
Before pursuing any geography based analysis, we must control for population density, as most of the most popular states (California) have large numbers of sightings due to it’s size. This is seen in the graph below showing absolute number of sightings per state.
ufostate <- ufo %>% filter(is.na(State) == FALSE & In_USA == TRUE)
ggplot(ufostate, aes(x = State)) + geom_bar() +
ggtitle("Sightings by State (no population correction)") + theme_light() + theme(axis.text.x=element_text(angle=60, hjust = 1)) + ylab('Sightings')
Now that we have corrected for population density, lets consider the states with the highest per-capita sightings reported - WA, VT, MT, AK, OR, and ME. We can notice that 4 of these states are on the coasts. Additionally, there seems to be a west coast bias to frequency of observations. We rationalize this by noting that the headquarters of NUFORC is in Washington state. We also observe that these states are fairly rural. In the next section, we will see that these states favor outdoor activities, unlike other rural states such as Alabama, Texas, and Georgia, where people tend to spend their time indoors.
USA_ufo_merged <- merge(USA_ufo, state_pops, by = "State")
USA_ufo_summary <- USA_ufo_merged %>% filter(year >= 1995) %>% group_by(State, Shape2, Population) %>% summarise(freq = n()) %>% mutate(`Prop` = 10000*freq/Population)
temp5 <- USA_ufo_merged %>% filter(year >= 1995) %>% group_by(State, Population) %>% summarise(freq = n()) %>% mutate(`Prop` = 10000*freq/Population)
temp5$state <- temp5$State
statebins(temp5, value_col = "Prop", text_color = "black", font_size = 3, legend_title = "Sightings Per 10,000 Residents", legend_position = "bottom") + labs(title = "The Coasts are More Supernatural")
ggplot(USA_ufo_summary, aes(x = reorder(State, -Prop), y = Prop)) + geom_bar(stat = "identity", fill = 'navy') + labs(y = "Sightings per 10,000 Residents", x = "State") +
ggtitle("Sightings by State, Population Corrected") + theme_light() + theme(axis.text.x=element_text(angle=90, hjust = 1))
Extraterrestrial Correlations
We use three different proxies to closely explore the correlation between amount of time spent outside with frequency of UFO sightings:
- Hours a day spent doing Sports, Excercise, and Recreation (https://www.bls.gov/spotlight/2017/sports-and-exercise/home.htm)
- Percentage of Population Engaged in Sports and Recreation daily (http://www.governing.com/topics/urban/gov-americans-time-use-survey-2015.html )
- Obesity rate (https://stateofobesity.org/adult-obesity/).
As there is no measure for how much time people spend outside, we believe that these 3 proxies are likely correlated with outdoor activity.
All three of these graphs show a clear correlation. Though there are many individual states that do not follow this pattern, overall the trend is clear - more exercise is correlated with more UFO sightings and a lower percentage of obese adults is correlated with more UFO sightings.
USA_ufo_summary2 <- USA_ufo_merged %>% filter(year >= 2009) %>% group_by(State, Population) %>% summarise(freq = n()) %>% mutate(`Prop` = 10000*freq/Population)
USA_ufo_summary2 <- merge(USA_ufo_summary2, excer, by = "State")
ggplot(USA_ufo_summary2, aes(x = ATUS, y = Prop)) + geom_point(stat = "identity") + labs(x = "Minutes a day spent doing Sports, Excercise, Recreation", y = "Sightings per 10,000 Residents") + geom_text_repel(label=USA_ufo_summary2$State)+ geom_smooth(method='lm',formula=y~x) + theme_light() + scale_x_continuous(breaks = seq(0,.7,.1),labels = seq(0,42,6)) +
ggtitle("Sightings by Exercise (2009 - Present)")
ggplot(USA_ufo_summary2, aes(x = Excercise, y = Prop)) + geom_point(stat = "identity") + labs(x = "Percentage of population engaged in sports and exercise on an average day", y = "Sightings per 10,000 Residents") + geom_text_repel(label=USA_ufo_summary2$State) + geom_smooth(method='lm',formula=y~x) + theme_light() + ggtitle("Sightings by Population engaged in Sports/Exercise (2009 - Present)")
USA_ufo_summary2 <- merge(USA_ufo_summary2, obese, by = "State")
ggplot(USA_ufo_summary2, aes(x = Obesity, y = Prop)) + geom_point(stat = "identity") + labs(x = "Percentage of Obese Adults", y = "Sightings per 10,000 Residents") + geom_text_repel(label=USA_ufo_summary2$State) + geom_smooth(method='lm',formula=y~x) + theme_light() + ggtitle("Sightings Correlated with Obesity (2009 - Present)")
Given this evidence, we have confidence that states where residents are more likely to spend time outside are also more likely to observe UFOs. This seemingly trivial result confirms our common sense perception that you have to spend time looking at the sky to see something unidentifiable.
We wanted to see if other state-by-state features were correlated with sightings per 10,000 residents, including political affiliation. Filtering for obervations in the last four years, we can see a slight negative trend with Donald Trump’s 2016 election margin by state. Though this is open for interpretation, in our opinion, we find it likely that Trump voters are much likely to spend time outside/exercising.
USA_ufo_summary3 <- USA_ufo_merged %>% filter(year >= 2014 & State != "DC") %>% group_by(State, Population) %>% summarise(freq = n()) %>% mutate(`Prop` = 10000*freq/Population)
USA_ufo_summary3 <- merge(USA_ufo_summary3, trump_state, by = "State")
ggplot(USA_ufo_summary3, aes(x = `Trump Margin`, y = Prop)) + geom_point(stat = "identity") + labs(x = "Percentage of Popular Vote for Donald Trump in 2016", y = "Sightings per 10,000 Residents") + geom_text_repel(label=USA_ufo_summary3$State) + geom_smooth(method='lm',formula=y~x) + theme_light() + ggtitle("Sightings correlated with Political Leaning (2014 - Present)")
Exploring Trends Over Time
The next phase of exploration was looking at our data over time. There is a clear change in the proportion of the shapes of observations since 1950. Long ago a majority of UFO sightings were circles, but over time the proportion of circles clearly shrank, with an marked increase in light, triangles, and rectangles. The spike in NAs are associated with paper logged reports that were backfilled into the database when the site came online in 1998. We show the graph with and without NAs.
ggplot(ufo1950, aes(x = year, fill = Shape2)) + geom_bar(position = "fill") + labs(x='Year', y="Relative Frequency") + ggtitle("Relative Frequency of Shapes over Time")+ scale_fill_manual(values = cbPalette, name = 'Duration') + scale_y_continuous(labels = percent)
ufo1950_filtered <- ufo1950 %>% filter(is.na(Shape2) == FALSE)
ggplot(ufo1950_filtered, aes(x = year, fill = Shape2)) + geom_bar(position = "fill") + labs(x='Year', y="Relative Frequency") + ggtitle("Relative Frequency of Shapes over Time (NA excluded)")+ scale_fill_manual(values = cbPalette, name = 'Duration') + scale_y_continuous(labels = percent)
In graphing the duration of sightings over time, we can see that duration seems to decrease over time, with a larger proportion of sightings less than 5 minutes. Before 1980 the data is more varied, but as we get closer to 1998 and especially after that, things seemed to even out.
ufo1950 <- ufo1950 %>% filter(is.na(Bin_Dur) == FALSE)
ggplot(ufo1950, aes(x = year, fill = Bin_Dur)) + geom_bar(position = "fill") + scale_fill_manual(values = cbPalette, name = 'Duration') + labs(x = "Year", y='Relative Frequency') + ggtitle('Relative Frequency of Sighting Duration')+ scale_y_continuous(labels = percent)
This graph exemplifies how the proportion of durations varies wildly before 1995, but after 1995 the proportion of durations of observations became less variable.
temp2 <- ufo %>% filter(year >= 1950 & year != 2018 & is.na(Shape2) == FALSE) %>% group_by(year) %>% summarize(yearfreq = n())
temp2 <- merge(ufo, temp2, by = "year")
shape_overtime <- temp2 %>% filter(year >= 1950 & year != 2018 & is.na(Shape2) == FALSE) %>% group_by(year, Shape2, yearfreq) %>% summarize(freq = n()) %>% mutate(`Prop` = freq/yearfreq)
ggplot(shape_overtime, aes(year, Prop, color = Shape2)) + geom_line() +
ggtitle("Saucers Become Out of Style") +
labs (x = "Year", y = "Proportion of Sightings") + scale_color_manual(values=cbPalette, name = "Shape")+ scale_y_continuous(labels = percent)
temp3 <- ufo %>% filter(year >= 1950 & year != 2018 & is.na(Bin_Dur) == FALSE) %>% group_by(year) %>% summarize(yearfreq = n())
temp3 <- merge(ufo, temp3, by = "year")
dur_overtime <- temp3 %>% filter(year >= 1950 & year != 2018 & is.na(Bin_Dur) == FALSE) %>% group_by(year, Bin_Dur, yearfreq) %>% summarize(freq = n()) %>% mutate(`Prop` = freq/yearfreq)
ggplot(dur_overtime, aes(year, Prop, color = Bin_Dur)) + geom_line() +
ggtitle("UFO Sighting Durations Over Time") + labs(x = "Year", y = "Proportion of Sightings") + theme_light() + scale_color_manual(values=cbPalette, name = "Shape")+ scale_y_continuous(labels = percent)
A summary of the above finding, the average duration over time can clearly be seen to be more variable prior to around 1995, where it became much more stable. There is also an overal decrease in sighting duration.
truedur_overtime <- ufo %>% filter(year >= 1950 & year != 2018 & is.na(Duration) == FALSE & Duration <= 3600) %>% group_by(year) %>% summarize(AvgDur = mean(Duration) )
ggplot(truedur_overtime, aes(year, AvgDur)) + geom_line() +
ggtitle("UFO Sighting Durations Over Time") + labs(x = "Year", y = "Mean Duration (Seconds)") + theme_light()